Intelligent Self-repairable Web Wrappers

نویسندگان

  • Emilio Ferrara
  • Robert Baumgartner
چکیده

The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources – the so called Web wrappers – which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reconfigurable Web Wrapper Agents

directly access the data. Web wrappers, however, must automate Web browsing sessions to extract data from the target Web pages so other applications can process that data. Each Web site has its own set of links, layout templates, and syntax. You could, in a brute-force solution, program a wrapper for each browsing session. However, such wrappers are sensitive to Web site changes and become diff...

متن کامل

Building intelligent Web applications using lightweight wrappers

The Web so far has been incredibly successful at delivering information to human users. So successful actually, that there is now an urgent need to go beyond a browsing human. Unfortunately, the Web is not yet a well organized repository of nicely structured documents but rather a conglomerate of volatile HTML pages. To address this problem, we present the World Wide Web Wrapper Factory (W4F), ...

متن کامل

Web Wrapper Specification Using Compound Filter Learning

Information available on the Internet is made to be read by humans, not to be processed by machines. To automatically access this information, there is a need for intelligent services that convert HTML documents into more suitable formats like XML. This can be achieved through generation of Web wrappers, programs designed to process pages of a given Web site. To generate such Web wrappers, an e...

متن کامل

Intelligent Wrapping of Information Sources: Getting Ready for the Electronic Market

Literature search and delivery in the World Wide Web becomes a rapidly expanding market. Up to now the search is mostly cost-free. But in the future we expect the appearance of more and more providers charging for their services. The main problems are finding the right provider and extracting the information. UniCats is a system for intelligent information search and extraction from multiple pr...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011